SW/HW partitioning
by gil_savir on Sep 20, 2009 |
gil_savir
Posts: 59 Joined: Dec 7, 2008 Last seen: May 10, 2021 |
||
Hi all.
Now that we have some profiling files I think we should start discussing SW/HW partitioning. I uploaded to the repository a profiling file (/oc-h264-encoder/trunk/doc/x264_profiling/gmon_files/round_300_frames_gmon.sum) that sums-up profiling info of about 20 different profiling files from the repository, which has about 300 frames. Inspecting this file and mobcal_HD_1280x720_gmon.out it appears that the four most computing intensive functions are: x264_pixel_satd_8x4() mc_chroma() get_ref() x264_pixel_sad_x4_16x16() I think we should choose one or two of these functions and implement them in hardware. In order to do this we first need to decide and implement SoC architecture. In this point we need help from some guys with knowledge of the OR1K processor. please give us your input. Another issue on the table is the environment we are going to use for development. I see two options: 1. work on development boards (if available). 2. interface the OR1K simulator to a verilog simulator and develop on simulation environment. please suggest more possible development environments. please comment. - gil |
RE: SW/HW partitioning
by gijoprems on Sep 23, 2009 |
gijoprems
Posts: 13 Joined: Jun 17, 2008 Last seen: Jan 8, 2014 |
||
As Gil mentioned, can we get some information on OR1K from any of the opencores members?
For the development environment, I think the second option would be feasible at least to start with. As we progress, we can move on to the actual development board. And, to setup the OR1K simulator and link it with verilog simulator, some or1k experts should lead this, I guess. -Gijo |
RE: SW/HW partitioning
by marcus.erlandsson on Sep 23, 2009 |
marcus.erlandsson
Posts: 38 Joined: Nov 22, 2007 Last seen: Mar 7, 2013 |
||
Hi,
Excellent, we will check out the profiling results... There are multiple-ways to execute simulations using the OpenRISC processor, I suggest that we start with arch-simulator + verilog-rtl-simulator + verilator-simulation. BR, Marcus |
RE: SW/HW partitioning
by gshankara on Sep 23, 2009 |
gshankara
Posts: 14 Joined: Aug 30, 2008 Last seen: Oct 20, 2009 |
||
Is there some kind of a README that we could use to set up the simulation/emulation platform on our systems?
Thanks, Guru |
RE: SW/HW partitioning
by gil_savir on Sep 23, 2009 |
gil_savir
Posts: 59 Joined: Dec 7, 2008 Last seen: May 10, 2021 |
||
There are multiple-ways to execute simulations using the OpenRISC processor, I suggest that we start with arch-simulator + verilog-rtl-simulator + verilator-simulation.
I've stumbled upon the OR SoC project: http://www.opencores.org/openrisc,orpsocv2 Could this be a starting point for developing the H.264 encoder? - gil |
RE: SW/HW partitioning
by marcus.erlandsson on Sep 24, 2009 |
marcus.erlandsson
Posts: 38 Joined: Nov 22, 2007 Last seen: Mar 7, 2013 |
||
There are multiple-ways to execute simulations using the OpenRISC processor, I suggest that we start with arch-simulator + verilog-rtl-simulator + verilator-simulation.
I've stumbled upon the OR SoC project: http://www.opencores.org/openrisc,orpsocv2 Could this be a starting point for developing the H.264 encoder? - gil Yes, the ORPSOC is a good starting point. We will then create a modified version as soon as we have decided what function to implement in HW. /Marcus |
RE: SW/HW partitioning
by jackoc on Sep 26, 2009 |
jackoc
Posts: 13 Joined: Sep 5, 2009 Last seen: May 16, 2010 |
||
There are multiple-ways to execute simulations using the OpenRISC processor, I suggest that we start with arch-simulator + verilog-rtl-simulator + verilator-simulation.
I've stumbled upon the OR SoC project: http://www.opencores.org/openrisc,orpsocv2 Could this be a starting point for developing the H.264 encoder? - gil Yes, the ORPSOC is a good starting point. We will then create a modified version as soon as we have decided what function to implement in HW. /Marcus Hi, all Except the simulation platform, maybe there is another important thing for us to do after the HW/SW patitioning, that is to construct a C/C++ MODEL based on x264 that have the same HW/SW partion like the real system and hardware module partion. It can help us understand the overall encoder more accurately, and provides reference model(data) for both HW and SW. Correct me if I lost something. As far as HW/SW partition is concerned, I suggest the following points after profiling information is analysed: 1) before mode of MB is known, calculation intensive operations are realated to Inter/Intra MB mode decision funciotns. "x264_macroblock_analyse" is the top function. 2) after mode is known:"x264_macroblock_encode" is the top function to be called to encode one macroblock, it includes mc, dct/idct, iq/q. Thus, I propose that "x264_macroblock_analyse" and "x264_macroblock_encode" is implemented by hardware. "x264_macroblock_analyse" includes motion estimation/compensation for Inter MBs, and mode decision for Intra MBs. "x264_macroblock_encode" includes dct/idct, q/iq, reconstruction, etc. To sum up, I suggest that, after QP is decided by rate control, hardware should independently decide a macroblock's mode (inter(4x4, 4x8, ..., reference index)/intra(4x4, 4x8, ..., vertical/diagonal...), and then dct/idct, q/iq, reconstruction, and encodes all the information to syntax element by calling variable length bitstream encoder. So, Macroblock is a basic level for hardware to encode, all the tasks except above is left to software. Maybe a good option is that the hardware can independently encode several MBs after QP is configured by software. I think that will be an optimization. Please correct me if I misunderstand. Regards, Jack |
RE: SW/HW partitioning
by gil_savir on Sep 27, 2009 |
gil_savir
Posts: 59 Joined: Dec 7, 2008 Last seen: May 10, 2021 |
||
Hi, all
Except the simulation platform, maybe there is another important thing for us to do after the HW/SW patitioning, that is to construct a C/C++ MODEL based on x264 that have the same HW/SW partion like the real system and hardware module partion. It can help us understand the overall encoder more accurately, and provides reference model(data) for both HW and SW. Correct me if I lost something. I agree. I think that such model should be started built when stable implementation of core is available.
As far as HW/SW partition is concerned, I suggest the following points after profiling information is analysed:
1) before mode of MB is known, calculation intensive operations are realated to Inter/Intra MB mode decision funciotns. "x264_macroblock_analyse" is the top function. 2) after mode is known:"x264_macroblock_encode" is the top function to be called to encode one macroblock, it includes mc, dct/idct, iq/q. Thus, I propose that "x264_macroblock_analyse" and "x264_macroblock_encode" is implemented by hardware. "x264_macroblock_analyse" includes motion estimation/compensation for Inter MBs, and mode decision for Intra MBs. "x264_macroblock_encode" includes dct/idct, q/iq, reconstruction, etc. To sum up, I suggest that, after QP is decided by rate control, hardware should independently decide a macroblock's mode (inter(4x4, 4x8, ..., reference index)/intra(4x4, 4x8, ..., vertical/diagonal...), and then dct/idct, q/iq, reconstruction, and encodes all the information to syntax element by calling variable length bitstream encoder. So, Macroblock is a basic level for hardware to encode, all the tasks except above is left to software. Maybe a good option is that the hardware can independently encode several MBs after QP is configured by software. I think that will be an optimization. Please correct me if I misunderstand. Regards, Jack It looks like a good and modular SW/HW partitioning. However, we should decide if we are going to: 1. try to do the partitioning in one big step, or 2. first try to move small part of the SW to HW (one or two functions), and when this will be stable, move on. Personally, I prefer the 2nd approach. It will allow us to concentrate in design of the SoC's architecture. When stable system with small HW partition is achieved, porting more SW functions to HW, will be much more easy. - gil |
RE: SW/HW partitioning
by jackoc on Sep 27, 2009 |
jackoc
Posts: 13 Joined: Sep 5, 2009 Last seen: May 16, 2010 |
||
Hi, all
Except the simulation platform, maybe there is another important thing for us to do after the HW/SW patitioning, that is to construct a C/C++ MODEL based on x264 that have the same HW/SW partion like the real system and hardware module partion. It can help us understand the overall encoder more accurately, and provides reference model(data) for both HW and SW. Correct me if I lost something. I agree. I think that such model should be started built when stable implementation of core is available.
As far as HW/SW partition is concerned, I suggest the following points after profiling information is analysed:
1) before mode of MB is known, calculation intensive operations are realated to Inter/Intra MB mode decision funciotns. "x264_macroblock_analyse" is the top function. 2) after mode is known:"x264_macroblock_encode" is the top function to be called to encode one macroblock, it includes mc, dct/idct, iq/q. Thus, I propose that "x264_macroblock_analyse" and "x264_macroblock_encode" is implemented by hardware. "x264_macroblock_analyse" includes motion estimation/compensation for Inter MBs, and mode decision for Intra MBs. "x264_macroblock_encode" includes dct/idct, q/iq, reconstruction, etc. To sum up, I suggest that, after QP is decided by rate control, hardware should independently decide a macroblock's mode (inter(4x4, 4x8, ..., reference index)/intra(4x4, 4x8, ..., vertical/diagonal...), and then dct/idct, q/iq, reconstruction, and encodes all the information to syntax element by calling variable length bitstream encoder. So, Macroblock is a basic level for hardware to encode, all the tasks except above is left to software. Maybe a good option is that the hardware can independently encode several MBs after QP is configured by software. I think that will be an optimization. Please correct me if I misunderstand. Regards, Jack It looks like a good and modular SW/HW partitioning. However, we should decide if we are going to: 1. try to do the partitioning in one big step, or 2. first try to move small part of the SW to HW (one or two functions), and when this will be stable, move on. Personally, I prefer the 2nd approach. It will allow us to concentrate in design of the SoC's architecture. When stable system with small HW partition is achieved, porting more SW functions to HW, will be much more easy. - gil Hi, gil Thanks for your response. From my point of view, this project consists of two parts, one is to develop a H.264 encoder IP core which contains HW (in verilog) and SW (firmware/driver), another is to construct a SOC platform based on OpenRISC to run frimware of the encoder (Let's simply neglect other components in the SoC). Then, it's clear that two tasks exist: 1) Design of the H.264 encoder IP core, including HW and SW; 2) Construct a SoC platform (or simulation environment) to simulate the encoder, at least it may include or1k/oc-h264-enocder/ddrc/ddr_chip; I think we should not mix these two tasks, the 1st is to design the IP core, and the 2nd is to verification the IP core. After the encoder IP is ready for use, we could construct the real SoC which will includes other IP, like camera/dma/wb_ctrl... To my way of thinking, at this time we should concentrate on design of the IP, but not the SoC. The 2nd proposal you suggest is an easy way, but will let our work repeatfully and trivial,things will go deadlock without a clear understanding of H.264 and x264. I suggest the following way to implement the encoder: 1) Read through H.264 standard and x264, then do HW/SW partition according to profiling info. 2) Figure out an architecture specification of the IP, which includes register definition, HW/SW partition, HW architecture definition (module partition and signal definition), SW flow, etc... 3) Develop a C model with similar architecture as HW based on x264 and architecture specification to offer reference data used by HW, even for early SW development 4) Develop RTL verilog based on architecture specification. 5) Build Simulation platform using OpenRISC. 6) Test and verification. 7) OK... One or more thing above may be done simultanously by different team. Regarding SW/HW partition, I think implement "x264_macroblock_analyse" and "x264_macroblock_encode" in hardware is resonale, the 1st includes Difference calculation functions such as ssd/sad/satd/motion_estimation/motion_compensation/..., the 2nd includes dct/idct/iq/q..., these are all calculation intensive functions according to the profiling info. Best Regards. Jack |
RE: SW/HW partitioning
by gil_savir on Sep 27, 2009 |
gil_savir
Posts: 59 Joined: Dec 7, 2008 Last seen: May 10, 2021 |
||
Hi, gil
Thanks for your response. From my point of view, this project consists of two parts, one is to develop a H.264 encoder IP core which contains HW (in verilog) and SW (firmware/driver), another is to construct a SOC platform based on OpenRISC to run frimware of the encoder (Let's simply neglect other components in the SoC). Then, it's clear that two tasks exist: 1) Design of the H.264 encoder IP core, including HW and SW; 2) Construct a SoC platform (or simulation environment) to simulate the encoder, at least it may include or1k/oc-h264-enocder/ddrc/ddr_chip; I think we should not mix these two tasks, the 1st is to design the IP core, and the 2nd is to verification the IP core. After the encoder IP is ready for use, we could construct the real SoC which will includes other IP, like camera/dma/wb_ctrl... To my way of thinking, at this time we should concentrate on design of the IP, but not the SoC. The 2nd proposal you suggest is an easy way, but will let our work repeatfully and trivial,things will go deadlock without a clear understanding of H.264 and x264. I suggest the following way to implement the encoder: 1) Read through H.264 standard and x264, then do HW/SW partition according to profiling info. 2) Figure out an architecture specification of the IP, which includes register definition, HW/SW partition, HW architecture definition (module partition and signal definition), SW flow, etc... 3) Develop a C model with similar architecture as HW based on x264 and architecture specification to offer reference data used by HW, even for early SW development 4) Develop RTL verilog based on architecture specification. 5) Build Simulation platform using OpenRISC. 6) Test and verification. 7) OK... One or more thing above may be done simultanously by different team. Regarding SW/HW partition, I think implement "x264_macroblock_analyse" and "x264_macroblock_encode" in hardware is resonale, the 1st includes Difference calculation functions such as ssd/sad/satd/motion_estimation/motion_compensation/..., the 2nd includes dct/idct/iq/q..., these are all calculation intensive functions according to the profiling info. Best Regards. Jack Hi Jack, I'm not sure I understand what is the C model you refer to in step 3). x264 is already implemented in C. In order to model it's HW behavior in C, will we need to model or1k processor in C as well (or use existing or1k model)? Could you please explain this model in more detail? Do you think that the RTL verilog development in step 4) should be done in modules level? separated from the SW? verified in module level, then integrated, and then verified in system level? - gil |
RE: SW/HW partitioning
by jackoc on Sep 28, 2009 |
jackoc
Posts: 13 Joined: Sep 5, 2009 Last seen: May 16, 2010 |
||
Hi, gil
Hi Jack, I'm not sure I understand what is the C model you refer to in step 3). x264 is already implemented in C. In order to model it's HW behavior in C, will we need to model or1k processor in C as well (or use existing or1k model)? Could you please explain this model in more detail? As far as I know, there is a systemc TLM2.0 block interface attached to or1ksim, so we don't need to redevelop or1k, we can run firmware directly on or1ksim, what we need is to write a C/C++ model that is compatible with TLM and has similar architecture as HW. With this model, we can 1) do early software development 2) offer reference data fo HW 3) make clear tasks of HW and SW team 4) ... The reason why not directyly use x264 is that 1) data structure of x264 is hard to implement by HW 2) some algorithm of x264 is hard to implement by HW, such as rate control, motion search, etc... Do you think that the RTL verilog development in step 4) should be done in modules level? separated from the SW? verified in module level, then integrated, and then verified in system level? - gil Yes, I prefer this flow, yet the SW may run on or1k. Using a step by step way will lead HW architecture to change very frequently, for example, register interface. Regards, Jack |
RE: SW/HW partitioning
by toanfxt on Sep 29, 2009 |
toanfxt
Posts: 4 Joined: Jun 24, 2008 Last seen: Sep 18, 2017 |
||
Hi gurus,
I agree with jackoc that "x264_macroblock_analyse" and "x264_macroblock_encode" is implemented by hardware.Your design flow proposal looks clearly from high-level modeling to verification. In my opinion, after we all accept the HW/SW partition, we need to define our algorithms (ME, MC, mode selection ...) that can be implemented in hardware and the algorithms must have similar or acceptable compression efficiency in comparison with x264. Regards, toanfxt |
RE: SW/HW partitioning
by gil_savir on Sep 29, 2009 |
gil_savir
Posts: 59 Joined: Dec 7, 2008 Last seen: May 10, 2021 |
||
Hi gurus,
I agree as well.
I agree with jackoc that "x264_macroblock_analyse" and "x264_macroblock_encode" is implemented by hardware.Your design flow proposal looks clearly from high-level modeling to verification. We might consider adding to that 264_slicetype_analyze. It consumes above 5% of CPU time. It also uses some of the functions x264_macroblock_analyse uses as well (sad, satd). It may allow us to reuse HW. In my opinion, after we all accept the HW/SW partition, we need to define our algorithms (ME, MC, mode selection ...) that can be implemented in hardware and the algorithms must have similar or acceptable compression efficiency in comparison with x264. I think, however, that our default algorithms choice should be adhered to the x264 algorithms choice. I think there are very few scenarios where algorithm implemented on GPP will outperform the same algorithm implemented in HW. I suggest that in cases where choosing different algorithm for HW implementation is required (due to better performance or any reason), the choice should be done by the person/team assigned for this function implementation (in HW). Debating on each algorithm between all team-members might stall our progress. Of course every member should be able to give its input on algorithms choice. |
RE: SW/HW partitioning
by jackoc on Sep 29, 2009 |
jackoc
Posts: 13 Joined: Sep 5, 2009 Last seen: May 16, 2010 |
||
Hi gurus,
I agree as well.
I agree with jackoc that "x264_macroblock_analyse" and "x264_macroblock_encode" is implemented by hardware.Your design flow proposal looks clearly from high-level modeling to verification. We might consider adding to that 264_slicetype_analyze. It consumes above 5% of CPU time. It also uses some of the functions x264_macroblock_analyse uses as well (sad, satd). It may allow us to reuse HW. In my opinion, after we all accept the HW/SW partition, we need to define our algorithms (ME, MC, mode selection ...) that can be implemented in hardware and the algorithms must have similar or acceptable compression efficiency in comparison with x264. I think, however, that our default algorithms choice should be adhered to the x264 algorithms choice. I think there are very few scenarios where algorithm implemented on GPP will outperform the same algorithm implemented in HW. I suggest that in cases where choosing different algorithm for HW implementation is required (due to better performance or any reason), the choice should be done by the person/team assigned for this function implementation (in HW). Debating on each algorithm between all team-members might stall our progress. Of course every member should be able to give its input on algorithms choice. Hi, Then we may have agreements on the following points: 1) "x264_macroblock_analyse" and "x264_macroblock_encode" and "264_slicetype_analyze" will be imlemented by HW. 2) the basic encoding unit by HW is a macroblock. Should we encode several macroblocks (number of MBs can be configured by SW) one time to enhance througput? different QP for different MB may promote coding effiency, but at the price of degrading HW ability. Maybe we should begin to draft out an architecture definition including the following: 1) basic parameters (technology process, expected frequency, throughput, max supported picture size...) 2) register definition 3) encoder operation flow (how HW and SW cooperating) ... please add on.. by the way, currently I'm not a member of this project, could anyone please add me? Regards, Jack |
RE: SW/HW partitioning
by gil_savir on Sep 29, 2009 |
gil_savir
Posts: 59 Joined: Dec 7, 2008 Last seen: May 10, 2021 |
||
Then we may have agreements on the following points: 1) "x264_macroblock_analyse" and "x264_macroblock_encode" and "264_slicetype_analyze" will be imlemented by HW. Agreed. Note that it means that almost all functionality will be implemented in HW, and the CPU will be mainly used for control. by the way, currently I'm not a member of this project, could anyone please add me? done. let us refer to the other points in another thread for architecture decisions. - gil |